Project: Investigate a IMDB Dataset EDA

This project was made as a part of the Egypt FWD of 2020

Author : Omar Ossama Mahmoud Ahmed

Table of Contents

  • Introduction
  • - Introduction and imposing questions.
  • Data Wrangling
  • - Data Investigation. - Data Cleaning & manipulation. - Reviewing formating & Structure Issues.
  • Exploratory Data Analysis
  • - Correlation overview. - Defining the first question. - Defining the second question. - Defining the third question. - Defining the fourth question. - Defining the fifth question.
  • Conclusions
  • - General conclusion on questions and insights. - Data Limitations. - Future suggestions for dataset.

Sources:

Introduction

In this project we are going to clean, investigate, explore and analyse IMDB Movies Dataset for correlations as well as extract some insights and trends from the valid data. The main objective of this project is to answer 5 questions related to this dataset.

  • Firstly, Does budget affect the movie's popularity?
  • Secondly, What is the most popular genre within most popular movies?
  • Thirdly, does high popularity result in high revenue?
  • Fourthly, who are the top 10 most commen actors of that period?
  • and finally, who are the top 10 most commen directors of that period?

Imports and Datasets


  • Pandas : for dataset handeling
  • Numpy : Support for Pandas and calculations
  • Datetime: for date and times calculations
  • Matplotlib : for visualization (basic)
  • plotly : for interative plots
In [1]:
# Importing libraries in use

import pandas as pd
import numpy as np
import datetime as dt
import sklearn.preprocessing
import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go

Defining Functions


  • ecdf() : for CDF calculation
In [2]:
# Defining ecdf Function for Cumilative Distribution Function
def ecdf(data):
    #credits DataCamp Justin Bois
    """Compute ECDF for a one-dimensional array of measurements."""
    # Number of data points: n
    n = len(data)

    # x-data for the ECDF: x
    x = np.sort(data)

    # y-data for the ECDF: y
    y = np.arange(1, n+1) / n

    return x, y

Data Wrangling

Importing and exploring the dataset

In [3]:
# Importing the dataset using pandas

df = pd.read_csv('https://d17h27t6h515a5.cloudfront.net/topher/2017/October/59dd1c4c_tmdb-movies/tmdb-movies.csv')

display(df.head())
display(df.info())
display(df.describe())
id imdb_id popularity budget revenue original_title cast homepage director tagline ... overview runtime genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj
0 135397 tt0369610 32.985763 150000000 1513528810 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... http://www.jurassicworld.com/ Colin Trevorrow The park is open. ... Twenty-two years after the events of Jurassic ... 124 Action|Adventure|Science Fiction|Thriller Universal Studios|Amblin Entertainment|Legenda... 6/9/15 5562 6.5 2015 1.379999e+08 1.392446e+09
1 76341 tt1392190 28.419936 150000000 378436354 Mad Max: Fury Road Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... http://www.madmaxmovie.com/ George Miller What a Lovely Day. ... An apocalyptic story set in the furthest reach... 120 Action|Adventure|Science Fiction|Thriller Village Roadshow Pictures|Kennedy Miller Produ... 5/13/15 6185 7.1 2015 1.379999e+08 3.481613e+08
2 262500 tt2908446 13.112507 110000000 295238201 Insurgent Shailene Woodley|Theo James|Kate Winslet|Ansel... http://www.thedivergentseries.movie/#insurgent Robert Schwentke One Choice Can Destroy You ... Beatrice Prior must confront her inner demons ... 119 Adventure|Science Fiction|Thriller Summit Entertainment|Mandeville Films|Red Wago... 3/18/15 2480 6.3 2015 1.012000e+08 2.716190e+08
3 140607 tt2488496 11.173104 200000000 2068178225 Star Wars: The Force Awakens Harrison Ford|Mark Hamill|Carrie Fisher|Adam D... http://www.starwars.com/films/star-wars-episod... J.J. Abrams Every generation has a story. ... Thirty years after defeating the Galactic Empi... 136 Action|Adventure|Science Fiction|Fantasy Lucasfilm|Truenorth Productions|Bad Robot 12/15/15 5292 7.5 2015 1.839999e+08 1.902723e+09
4 168259 tt2820852 9.335014 190000000 1506249360 Furious 7 Vin Diesel|Paul Walker|Jason Statham|Michelle ... http://www.furious7.com/ James Wan Vengeance Hits Home ... Deckard Shaw seeks revenge against Dominic Tor... 137 Action|Crime|Thriller Universal Pictures|Original Film|Media Rights ... 4/1/15 2947 7.3 2015 1.747999e+08 1.385749e+09

5 rows × 21 columns

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    10866 non-null  int64  
 1   imdb_id               10856 non-null  object 
 2   popularity            10866 non-null  float64
 3   budget                10866 non-null  int64  
 4   revenue               10866 non-null  int64  
 5   original_title        10866 non-null  object 
 6   cast                  10790 non-null  object 
 7   homepage              2936 non-null   object 
 8   director              10822 non-null  object 
 9   tagline               8042 non-null   object 
 10  keywords              9373 non-null   object 
 11  overview              10862 non-null  object 
 12  runtime               10866 non-null  int64  
 13  genres                10843 non-null  object 
 14  production_companies  9836 non-null   object 
 15  release_date          10866 non-null  object 
 16  vote_count            10866 non-null  int64  
 17  vote_average          10866 non-null  float64
 18  release_year          10866 non-null  int64  
 19  budget_adj            10866 non-null  float64
 20  revenue_adj           10866 non-null  float64
dtypes: float64(4), int64(6), object(11)
memory usage: 1.7+ MB
None
id popularity budget revenue runtime vote_count vote_average release_year budget_adj revenue_adj
count 10866.000000 10866.000000 1.086600e+04 1.086600e+04 10866.000000 10866.000000 10866.000000 10866.000000 1.086600e+04 1.086600e+04
mean 66064.177434 0.646441 1.462570e+07 3.982332e+07 102.070863 217.389748 5.974922 2001.322658 1.755104e+07 5.136436e+07
std 92130.136561 1.000185 3.091321e+07 1.170035e+08 31.381405 575.619058 0.935142 12.812941 3.430616e+07 1.446325e+08
min 5.000000 0.000065 0.000000e+00 0.000000e+00 0.000000 10.000000 1.500000 1960.000000 0.000000e+00 0.000000e+00
25% 10596.250000 0.207583 0.000000e+00 0.000000e+00 90.000000 17.000000 5.400000 1995.000000 0.000000e+00 0.000000e+00
50% 20669.000000 0.383856 0.000000e+00 0.000000e+00 99.000000 38.000000 6.000000 2006.000000 0.000000e+00 0.000000e+00
75% 75610.000000 0.713817 1.500000e+07 2.400000e+07 111.000000 145.750000 6.600000 2011.000000 2.085325e+07 3.369710e+07
max 417859.000000 32.985763 4.250000e+08 2.781506e+09 900.000000 9767.000000 9.200000 2015.000000 4.250000e+08 2.827124e+09

For targeted analysis in question, certain columns are irrelevent and pose problems with missing data, thus dropping said columns is preferable for the analysis process


Data Cleaning

Dropping irrelavent columns

In [4]:
# Dropping columns that pose problems and are irrelavent

columns_todrop = ['id','imdb_id','homepage','tagline','keywords','production_companies','budget_adj','revenue_adj']

df.drop(columns_todrop,axis=1,inplace=True)
In [5]:
# Dropping missing data to further clean the set

df.dropna(inplace=True)

Reformating Data

Reformatting cast, genere and directors columns as they contain a concatenated list of elements that needs to be split into a list

In [6]:
# splitting strings and converting them to strings

for i in df.index:
    df.at[i,'cast'] = df.cast[i].split(sep='|')
    df.at[i,'genres'] = df.genres[i].split(sep='|')
    df.at[i,'director'] = df.director[i].split(sep='|')
    
# converting date to datetime format

df.release_date = pd.to_datetime(df.release_date, format='%m/%d/%y')

# fixing pandas datetime issue with dates formated as %YY befor 1969

df['release_date'] = df['release_date'].mask(df['release_date'].dt.year > 2015, 
                             df['release_date'] - pd.Timedelta(36500, unit = 'D'))

# resetting index

df.reset_index(inplace=True,drop=True)

Checking final Data Format

In [7]:
display(df.info())
display(df.head())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10730 entries, 0 to 10729
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   popularity      10730 non-null  float64       
 1   budget          10730 non-null  int64         
 2   revenue         10730 non-null  int64         
 3   original_title  10730 non-null  object        
 4   cast            10730 non-null  object        
 5   director        10730 non-null  object        
 6   overview        10730 non-null  object        
 7   runtime         10730 non-null  int64         
 8   genres          10730 non-null  object        
 9   release_date    10730 non-null  datetime64[ns]
 10  vote_count      10730 non-null  int64         
 11  vote_average    10730 non-null  float64       
 12  release_year    10730 non-null  int64         
dtypes: datetime64[ns](1), float64(2), int64(5), object(5)
memory usage: 1.1+ MB
None
popularity budget revenue original_title cast director overview runtime genres release_date vote_count vote_average release_year
0 32.985763 150000000 1513528810 Jurassic World [Chris Pratt, Bryce Dallas Howard, Irrfan Khan... [Colin Trevorrow] Twenty-two years after the events of Jurassic ... 124 [Action, Adventure, Science Fiction, Thriller] 2015-06-09 5562 6.5 2015
1 28.419936 150000000 378436354 Mad Max: Fury Road [Tom Hardy, Charlize Theron, Hugh Keays-Byrne,... [George Miller] An apocalyptic story set in the furthest reach... 120 [Action, Adventure, Science Fiction, Thriller] 2015-05-13 6185 7.1 2015
2 13.112507 110000000 295238201 Insurgent [Shailene Woodley, Theo James, Kate Winslet, A... [Robert Schwentke] Beatrice Prior must confront her inner demons ... 119 [Adventure, Science Fiction, Thriller] 2015-03-18 2480 6.3 2015
3 11.173104 200000000 2068178225 Star Wars: The Force Awakens [Harrison Ford, Mark Hamill, Carrie Fisher, Ad... [J.J. Abrams] Thirty years after defeating the Galactic Empi... 136 [Action, Adventure, Science Fiction, Fantasy] 2015-12-15 5292 7.5 2015
4 9.335014 190000000 1506249360 Furious 7 [Vin Diesel, Paul Walker, Jason Statham, Miche... [James Wan] Deckard Shaw seeks revenge against Dominic Tor... 137 [Action, Crime, Thriller] 2015-04-01 2947 7.3 2015

Exploratory Data Analysis

Correlation doesn't imply causation!

In [8]:
df.corr().style.background_gradient(cmap='Blues')
Out[8]:
popularity budget revenue runtime vote_count vote_average release_year
popularity 1.000000 0.544214 0.662833 0.138205 0.800613 0.217847 0.093104
budget 0.544214 1.000000 0.734453 0.192101 0.632046 0.087181 0.119107
revenue 0.662833 0.734453 1.000000 0.164264 0.790885 0.178452 0.059102
runtime 0.138205 0.192101 0.164264 1.000000 0.164947 0.177145 -0.119130
vote_count 0.800613 0.632046 0.790885 0.164947 1.000000 0.260520 0.110360
vote_average 0.217847 0.087181 0.178452 0.177145 0.260520 1.000000 -0.127670
release_year 0.093104 0.119107 0.059102 -0.119130 0.110360 -0.127670 1.000000
Correlation shows:
  • A strong positive correlation between budget, revenue, popularity and vote_count (Which to be expected).
  • A moderate positive correlation between popularity and runtime columns.
  • A weak negative between Vote Average and release year columns.

Does budget affect the movie's popularity?

In [9]:
# using plot.ly dynamic library to plot the relationship between revenue and budget with popularity as the third dimension (color)

# subsetting only successful movies where revenue > budget
df_rev = df[df.revenue > df.budget]

fig = px.scatter(data_frame=df_rev,x='revenue',y='budget',color='popularity',hover_data=['original_title'],title = 'Relationship between Revenue & Budget')
fig.show()
In [10]:
#Using ecdf to compute the CDF
x1,y1 = list(ecdf(df_rev.budget))
x,y = list(ecdf(df_rev.revenue))

#Create a subplot to fit two axis
fig = make_subplots(rows=1, cols=2,subplot_titles=(f'Cumulative distribution function of Budget', f'Cumulative distribution function of revenue'))

#add first plot of Budget
fig.add_trace(
    go.Scatter(x= x1,y = y1,name = f'CDF of Budget'),
    row=1, col=1
)

#add second plot of Revenue
fig.add_trace(
    go.Scatter(x = x,y = y,name = f'CDF of Revenue'),
    row=1, col=2
)

#control title and figure dimentions
fig.update_layout(height=500, width=1000, title_text="Cumulative distribution functions")
fig.show()

Conclusions :


  • Relationship between Revenue & Budget shows that extream high budgets doesn't necessarily reflect high revenue.
  • Highest revenue is mid-ranged budget movie yet above average.
  • High popularity seems to increase revenue yet not significantly.
  • Relative to the data given, extream popular movies and movies with high revenue are considered outlires.
  • CDF shows that 78% of the dataset movies has revenue at 150M or less, where the scatter plot shows significant popular movies starting at 160M Revenue, which further implies that around 20% of the movies are above this threshold.
  • CDF also shows that 80% of the movies has budgets less than 50M with only 20% occupying the range from 50M to 300M.
In [11]:
# Subsetting top 10 most popular movies
df_temp_gen = df.nlargest(10,'popularity').explode('genres')

# Counting genres within top 10 popular movies
df_temp_gen['count'] = df_temp_gen.groupby('genres')['genres'].transform('count')

# Sorting and removing duplicates
top_genres = df_temp_gen[['genres','count']].drop_duplicates().sort_values('count',ascending=False).nlargest(10,'count').reset_index()

# Displaying results
display(top_genres[['genres','count']])
genres count
0 Adventure 9
1 Science Fiction 9
2 Action 7
3 Thriller 5
4 Drama 1
5 Fantasy 1

Conclusions :


  • Between the top 10 most popular movies from 1960 till 2015, Adventure and Sci-Fi are the most common genres.

Change in most commen genres over time

In [12]:
# Grouping df by genres by each year in TEMP df
df_temp = df.explode('genres')
df_temp['count'] = df_temp.groupby('genres')['genres'].transform('count')

# Listing top 10 most common genres of all time
top_gen = df_temp[['genres','count']].drop_duplicates().sort_values('count',ascending=False).nlargest(10,'count').reset_index()
In [13]:
# Subsetting dataframe to contain only used columns
df_temp = df_temp[['release_year','genres','count']].groupby(['release_year','genres']).count().reset_index()

# Subsetting dataframe to contain only most common genres of all time
df_temp = df_temp[df_temp['genres'].isin(top_gen['genres'])]

# Plotting relationship
fig = px.line(df_temp,x = 'release_year', y = 'count',color = 'genres',title=f"Change in top 10 genres' popularity over time")
fig.show()

Conclusions :


  • Contradicting with the previous section, a lineplot showing the top 10 most common genres and their change through the years.
  • It suggests that Drama is the most 'common' genre of all time yet not necessarily the most popular.

Does high popularity result in high revenue?

In [14]:
# Subsetting successful movies
df_rev = df[df.revenue > df.budget]

# Plotting the relationship between revenue and popularity with budget as the third dimension (color)
fig = px.scatter(data_frame=df_rev,x='revenue',y='popularity',color='budget',hover_data=['original_title'],title = 'Relationship between Revenue & Popularity')
fig.show()
In [15]:
#Using ecdf to compute the CDF
x1,y1 = list(ecdf(df_rev.popularity))
x,y = list(ecdf(df_rev.revenue))

#Create a subplot to fit two axis
fig = make_subplots(rows=1, cols=2,subplot_titles=(f'Cumulative distribution function of Popularity', f'Cumulative distribution function of revenue'))

#add first plot of Popularity
fig.add_trace(
    go.Scatter(x= x1,y = y1,name = f'CDF of Popularity'),
    row=1, col=1
)

#add second plot of Revenue
fig.add_trace(
    go.Scatter(x = x,y = y,name = f'CDF of Revenue'),
    row=1, col=2
)

#control title and figure dimentions
fig.update_layout(height=500, width=1000, title_text="Cumulative distribution functions")
fig.show()

Conclusions :


  • We explored the relationship between the movie's revenue x budget x popularity using different perspective as budget in third dimension (color)
  • It suggests that most popular movies had midrange budgets yet above average and relatively high revenue.
  • CDF shows that 78% of the dataset movies has revenue at 150M or less, where the scatter plot shows significant popular movies starting at 160M Revenue, which further implies that around 20% of the movies are above this threshold.
  • CDF also shows that 95% of the movies has popularities less than 3.5 with only 5% occupying the range from 4 to 32.

Who are the top 10 most commen actors of that period?

In [16]:
# Subsetting and exploding 'cast' lists
df_temp2 = df.explode('cast')

# Grouping and counting each actor occurence in a movie
df_temp2['count'] = df_temp2.groupby('cast')['cast'].transform('count')

# Sorting and removing duplicates
df_temp2 = df_temp2[['cast','count']].drop_duplicates().sort_values('count',ascending=False).reset_index().nlargest(10,'count')

# Displaying results
display(df_temp2[['cast','count']])
cast count
0 Robert De Niro 72
1 Samuel L. Jackson 71
2 Bruce Willis 62
3 Nicolas Cage 61
4 Michael Caine 53
5 Robin Williams 51
6 John Cusack 50
7 John Goodman 49
8 Morgan Freeman 49
9 Susan Sarandon 48

Conclusions :


  • Between the top 10 most common actors from 1960 till 2015, the 2 oscar winner Robert De Niro is on top with 72 appearances.

Who are the top 10 most commen directors of that period?

In [17]:
# Subsetting and exploding 'cast' lists
df_temp3 = df.explode('director')

# Grouping and counting each director occurence in a movie
df_temp3['count'] = df_temp3.groupby('director')['director'].transform('count')

# Sorting and removing duplicates
df_temp3 = df_temp3[['director','count']].drop_duplicates().sort_values('count',ascending=False).reset_index().nlargest(10,'count')

# Displaying results
display(df_temp3[['director','count']])
director count
0 Woody Allen 46
1 Clint Eastwood 34
2 Martin Scorsese 30
3 Steven Spielberg 30
4 Steven Soderbergh 23
5 Ridley Scott 23
6 Ron Howard 22
7 Joel Schumacher 21
8 Brian De Palma 20
9 David Cronenberg 19

Conclusions :


  • Between the top 10 most common directors from 1960 till 2015, the 4 times oscar winner Woody Allen is on top with 46 Movies directed.

Conclusions

First section;

  • We visitied a statistical measure .corr(), which shows a breif correlation matrix between data features.
  • It suggests that there is a correlation between the popularity of a movie and the vote count at 80% correlation.
  • Which is believable yet we need to keep in mind that correlation does not necessarily mean causation.

Second section;

  • We explored the relationship between the movie's revenue x budget x popularity
  • It suggests that too high of a budget doesn't result in high popularity nor revenue.
  • Most popular movies had a budget between 150M and 250M.

Third section;

  • Between the top 10 most popular movies from 1960 till 2015, Adventure and Sci-Fi are the most common genres.

Fourth section;

  • Contradicting with the previous section, a lineplot showing the top 10 most common genres and their change through the years.
  • It suggests that Drama is the most 'common' genre of all time yet not necessarily the most popular.

Fifth section;

  • We explored the relationship between the movie's revenue x budget x popularity using different perspective as budget in third dimension (color)
  • It suggests that most popular movies had midrange budgets yet above average and relatively high revenue.

Sixth section;

  • Between the top 10 most common actors from 1960 till 2015, the 2 oscar winner Robert De Niro is on top with 72 appearances.

Seventh section;

  • Between the top 10 most common directors from 1960 till 2015, the 4 times oscar winner Woody Allen is on top with 46 Movies directed.

Data Limitations:

  • Inconsistant data formating in dates form a great limitation in data sorting and time-series analysis yet could be cleaned.
  • Missing Data in production companies makes analysis of said column almost undoable unless data is scrapped from the web.
  • Uncollected data such as user reviews (textual data), Shooting location, production country and Series relationships (trilogies and movie parts).
  • Such missing data limits the analysis regarding countries and their revenues and touristic features with regard to popular movies filmed there, as well as user personal reviews on movies which paves the way to NLP and customized analysis of their most used words per user per movie.

Recommendations for future analysis:

(Feature Engineering)

  • Profit column could be calculated from budget and revenue columns giving a relativly new perspective to the dataset.
  • A (win/lose) column could be calculated from profit column for ease of classification. - worked around in section two-
  • An age column could be calculated as well as era column which paves the way for analysis of cinematography over time and difference in popularity of movies over time.